9 research outputs found
Neural Networks for Analysing Music and Environmental Audio
PhDIn this thesis, we consider the analysis of music and environmental audio
recordings with neural networks. Recently, neural networks have been
shown to be an effective family of models for speech recognition, computer
vision, natural language processing and a number of other statistical modelling
problems. The composite layer-wise structure of neural networks
allows for flexible model design, where prior knowledge about the domain
of application can be used to inform the design and architecture of the
neural network models. Additionally, it has been shown that when trained
on sufficient quantities of data, neural networks can be directly applied to
low-level features to learn mappings to high level concepts like phonemes
in speech and object classes in computer vision. In this thesis we investigate
whether neural network models can be usefully applied to processing
music and environmental audio.
With regards to music signal analysis, we investigate 2 different problems.
The fi rst problem, automatic music transcription, aims to identify the
score or the sequence of musical notes that comprise an audio recording.
We also consider the problem of automatic chord transcription, where the
aim is to identify the sequence of chords in a given audio recording. For
both problems, we design neural network acoustic models which are applied
to low-level time-frequency features in order to detect the presence of
notes or chords. Our results demonstrate that the neural network acoustic
models perform similarly to state-of-the-art acoustic models, without the
need for any feature engineering. The networks are able to learn complex
transformations from time-frequency features to the desired outputs, given
sufficient amounts of training data. Additionally, we use recurrent neural
networks to model the temporal structure of sequences of notes or chords,
similar to language modelling in speech. Our results demonstrate that
the combination of the acoustic and language model predictions yields
improved performance over the acoustic models alone. We also observe
that convolutional neural networks yield better performance compared to
other neural network architectures for acoustic modelling.
For the analysis of environmental audio recordings, we consider the problem
of acoustic event detection. Acoustic event detection has a similar
structure to automatic music and chord transcription, where the system
is required to output the correct sequence of semantic labels along with
onset and offset times. We compare the performance of neural network
architectures against Gaussian mixture models and support vector machines.
In order to account for the fact that such systems are typically
deployed on embedded devices, we compare performance as a function of
the computational cost of each model. We evaluate the models on 2 large
datasets of real-world recordings of baby cries and smoke alarms. Our results
demonstrate that the neural networks clearly outperform the other
models and they are able to do so without incurring a heavy computation
cost
Automatic Environmental Sound Recognition: Performance versus Computational Cost
In the context of the Internet of Things (IoT), sound sensing applications
are required to run on embedded platforms where notions of product pricing and
form factor impose hard constraints on the available computing power. Whereas
Automatic Environmental Sound Recognition (AESR) algorithms are most often
developed with limited consideration for computational cost, this article seeks
which AESR algorithm can make the most of a limited amount of computing power
by comparing the sound classification performance em as a function of its
computational cost. Results suggest that Deep Neural Networks yield the best
ratio of sound classification accuracy across a range of computational costs,
while Gaussian Mixture Models offer a reasonable accuracy at a consistently
small cost, and Support Vector Machines stand between both in terms of
compromise between accuracy and computational cost
Multi-task Learning for Speaker Verification and Voice Trigger Detection
Automatic speech transcription and speaker recognition are usually treated as
separate tasks even though they are interdependent. In this study, we
investigate training a single network to perform both tasks jointly. We train
the network in a supervised multi-task learning setup, where the speech
transcription branch of the network is trained to minimise a phonetic
connectionist temporal classification (CTC) loss while the speaker recognition
branch of the network is trained to label the input sequence with the correct
label for the speaker. We present a large-scale empirical study where the model
is trained using several thousand hours of labelled training data for each
task. We evaluate the speech transcription branch of the network on a voice
trigger detection task while the speaker recognition branch is evaluated on a
speaker verification task. Results demonstrate that the network is able to
encode both phonetic \emph{and} speaker information in its learnt
representations while yielding accuracies at least as good as the baseline
models for each task, with the same number of parameters as the independent
models
Improving Voice Trigger Detection with Metric Learning
Voice trigger detection is an important task, which enables activating a
voice assistant when a target user speaks a keyword phrase. A detector is
typically trained on speech data independent of speaker information and used
for the voice trigger detection task. However, such a speaker independent voice
trigger detector typically suffers from performance degradation on speech from
underrepresented groups, such as accented speakers. In this work, we propose a
novel voice trigger detector that can use a small number of utterances from a
target speaker to improve detection accuracy. Our proposed model employs an
encoder-decoder architecture. While the encoder performs speaker independent
voice trigger detection, similar to the conventional detector, the decoder
predicts a personalized embedding for each utterance. A personalized voice
trigger score is then obtained as a similarity score between the embeddings of
enrollment utterances and a test utterance. The personalized embedding allows
adapting to target speaker's speech when computing the voice trigger score,
hence improving voice trigger detection accuracy. Experimental results show
that the proposed approach achieves a 38% relative reduction in a false
rejection rate (FRR) compared to a baseline speaker independent voice trigger
model.Comment: Submitted to InterSpeech 202